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Chapter 1 

Addressing the Audience 


Mevrouw de Rector-Magntflcus, 

Mynheer de Decaan en Leden van het bestuur van de Faculteit Wiskunde 
en Natuurwetenschappen, 

Dames en heren hoogleraren. 

Dames en heren van de wetenschappelyke en de ondersteunende 
stqf. 

Dames en heren studenten, 

En voorts Gy alien die deze plechtigheid met uw aanwezigheid vereert, 

Deze aanspreektttel vormt de brug tussen het verleden en het heden. 

De tekst wordt in het Engels uitgesproken. 

On 4 September 2014, shortly after the start of the aeademie 
year, the Leiden Centre of Data Seienee (LCDS) was offieially opened, 
in this same historie building. On that day the university reeognized 
the importanee of this new field of seienee. The opening speeehes 
were by Trevor Hastie, professor of Mathematieal Seienees at Stan¬ 
ford University, and by Prinee Constantijn van Oranje, who told us 
about his work for the Digital Agenda of the European Commission. 
Many people have sinee then stressed the importanee of data. By 
the end of 2014 both the European Commission and the Nether¬ 
lands government named data a key asset for the eeonomy. Both 
established a data strategy, and both announeed substantial fund¬ 
ing for data seienee researeh. Data may well have been the seeond 
most used word of last year. 

Today, everybody and everything produees data. People produee 
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large amounts of data in social networks and in commercial trans¬ 
actions. Medical, corporate, and government databases continue to 
grow. Ten years ago there were a billion Internet users. Now there 
are more than three billion, most of whom are mobile|^ Sensors 
continue to get cheaper and are increasingly connected, creating an 
Internet of Things. The next three billion users of the Internet will 
not all be human, and will generate a large amount of data. 

In every discipline, large, diverse, and rich data sets are emerg¬ 
ing, from astrophysics, to the life sciences, to medicine, to the be¬ 
havioral sciences, to finance and commerce, to the humanities and 
to the arts. In every discipline people want to organize, analyze, op¬ 
timize and understand their data to answer questions and to deepen 
insights. 

The availability of so much data and the ability to interpret it 
are changing the way the world operates. The number of sciences 
using this approach is increasing. The science that is transform¬ 
ing this ocean of data into a sea of knowledge is called data sci¬ 
ence. In many sciences the impact on the research methodology is 
profound—some even call it a paradigm shift. 


^source: http://Internethvestat.com 





Chapter 2 

Ebola as Data Challenge 


First I will address the question of why there is so mueh interest 
in data. I will answer this question by diseussing one of the most 
visible reeent ehallenges to publie health of the moment, the 2014 
Ebola outbreak in West Afriea. 

2.1 United Nations Global Pulse 

Aid organizations reeognize the neeessity of eorreet information for 
effeetive humanitarian aid, espeeially when disasters have disrupted 
the funetioning of government institutions. The United Nations has 
started Global Pulse, a flagship data seienee initiative of Seeretary- 
General Ban Ki-moon. The goal of Global Pulse is to aeeelerate dis- 
eovery, development and adoption of data seienee innovations for 
sustainable development and humanitarian aetionj^ Global Pulse 
was started in 2014. 

The United Nations issued a report to the Seeretary-General enti¬ 
tled A World That Counts: Mobilising the Data Revolution for Sustain¬ 
able Development. At the presentation of the report, the eo-ehair 
of the Expert Group, Enrieo Giovannini, noted that: “We live in 
a world that faees rapidly-evolving humanitarian and development 
ehallenges, as the Ebola epidemie so tragieally proves. Govern¬ 
ments, eompanies, NGOs and individuals need good data to know 
where problems are, how to fix them, and if the solutions are work¬ 
ing. But eurrent data are not good enough. Too many people and 

^ http: / / unglobalpulse. org 
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2.1. United Nations Global Pulse 


issues are not eounted or measured, there are huge and growing in¬ 
equalities between the information-rieh and the information-poor.” 


2.1.1 The 2014 Ebola Outbreak 


Outbreaks of eontagious diseases lead to severe disruptions of so- 
eiety, not to mention the loss of life in all its tragedy. The history of 
the fight against diseases shows some sueeesses, sueh as against 
the plague and pox, but there are still many diseases that we have 
not been able to eradieate, sueh as malaria, tubereulosis, and in¬ 
fluenza. 

Ebola is one sueh unsolved disease. The first reported outbreak 
of Ebola was in 1976, in the rural village of Yambuku in Zaire, 100 
km from the white water river, or Ebola, as it is ealled in the lo- 
eal tongue. The disease was so terrible, that, to avoid stigma to 
the villagers of Yambuku, it was named after the far away river in¬ 
stead (ef., iWordsworth] (2014) ). Small outbreaks of Ebola have been 
oeeurring regularly in the past. They have been reported in Zaire 
and Sudan, in 1995, in 2000, in 2003, in 2007, and in 2012. The 
most reeent outbreak is the 2014 outbreak, in West Afriea. This was 
the first outbreak in an urban environment. The outbreak started 
in Guinea, and spread to Liberia and Sierra Leone. 

Researehers traeed the outbreak to a two-year old ehild who died 
in Deeember 2013. In this outbreak, half of the people who suffered 
from the disease died. As this outbreak oeeurred in an urban en¬ 
vironment, it spread mueh quieker than previous outbreaks, and 
eaused more fatalities. By early 2015 this number had reaehed 
10,000. As the outbreak progressed, many hospitals, short on staff 
and supplies, beeame overwhelmed and elosed, leading health ex¬ 
perts to state that this may be eausing a death toll that is likely 
to exeeed that of the disease itself. Hospital workers are espeeially 
vulnerable to eatehing the disease sinee they ean easily eome into 
eontaet with highly eontagious body fluids. The World Health Or¬ 
ganization (WHO) reported that ten pereent of the dead have been 
healtheare workers. 

The virus is thought to reside in fruit bats. As of this writing, 
there are no approved vaeeines or adequate treatments for Ebola, 
although trials are under way. The disease spreads between hu¬ 
mans by eontaet with bodily fluids, sueh as blood, or sweat. The in- 
eubation period is long, between one and three weeks. This long in- 
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cubation period is one of the faetors that allow the disease to spread 
so effeetively. Furthermore, Ebola symptoms initially resemble the 
flu or malaria. The outbreak happened in eountries with a poor 
health infrastrueture (Van de Walle & Comes 2014) . Infeeted peo¬ 
ple are often misdiagnosed, are not treated, and thus unknowingly 
infeet healthy people. 

Past outbreaks were brought under eontrol within a few weeks; 
the 2014 Ebola outbreak is the first one to reaeh epidemie propor¬ 
tions. The epidemie has a signifieant eeonomie effeet. People are 
fleeing from affeeted areas, ereating a refugee problem and weak¬ 
ening the eeonomy. Movement of people away from affeeted areas 
has disturbed agrieultural aetivlties. The UN Food and Agrieulture 
Organisation (FAO) warned that the outbreak endangered harvest 
and food seeurity in West Afriea. Liberia and Sierra Leone struggled 
and initially failed to eontain the disease. On 8 August 2014, the 
World Health Organization deelared the outbreak an international 
emergeney. 

The laek of reliable data is a serious eontributing faetor to the 
2014 Ebola outbreak, aeeording to the World Health Organization. 
Humanitarian aid ageneies eannot respond appropriately; misinfor¬ 
mation leads to widespread fear among the population. 


2.1.2 Three Data Challenges 

To address the laek of data, innovative data analysis methods ean 
be a help. They ean improve the reliability of data, and support 
redueing the effeets of tragedies, as the United Nations report on 
the Global Pulse indieates. In this way, data seienee is ehanging 
the way that humanitarian problems are solved in our world. 

As this leeture is being prepared in early 2015, aid workers and 
seientists have worked hard to eontain the effeets of the Ebola out¬ 
break. There are three main ehallenges for data seientists who are 
attempting to resolve the tragedy. These are; (1) diagnosis, (2) epi- 
demiologieal spread, and (3) treatment and drug diseovery. I will 
now diseuss these ehallenges. 
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2.2. Diagnosis 


2.2 Diagnosis 

The first challenge is in diagnosing Ebola. Conventional diagnostic 
tests require specialised equipment and highly trained personnel. 
There are few suitable testing centers in West Africa, which leads 
to delays in diagnoses. In December 2014, a WHO conference in 
Geneva aimed to work out which diagnostic tools could be used to 
identify Ebola reliably and more quickly. The meeting sought to 
identify tests that can be used by untrained staff, do not require 
electricity or can run on batteries or solar power and use reagents 
that can withstand temperatures of 40°C. On December 29, 2014, 
the US Food and Drug Administration approved a test on patients 
with symptoms of Ebola. 


2.2.1 Reliable Health Data 


The difficulty in diagnosing Ebola is one of the reasons for the dis¬ 
ease to spread unnoticed. Doctors and hospitals are underequipped 
and therefore underreport Ebola cases. Months passed between the 
first Ebola case and its reporting. Data scientists have worked to 
address the unreliability of Ebola data. The Northeastern Univer¬ 
sity has published an online model which assesses the progression 
of the epidemic based on simulations of a typical epidemic spread. 
The analysis is presented as a live paper that is continuously up¬ 
dated with new data, projections and analysis (Gomes et al.[ [2014) . 

To acquire more reliable data, efforts have moved to crowdsourc¬ 
ing initiatives that use mobile phones and SMS service. Since the 
SMS and voice data are location-specific, it is possible to create 
maps that correlate public sentiment to location. Others have cre¬ 
ated cheap alternative diagnostic tools, such as checklist apps for 
smartphones. The apps may reduce fear and uncertainty among the 
population, possibly reducing the refu gee problem and its disr uy: 
tive effect on the fragile economies (cf., Parejo & Maestre (2015 


2.2.2 Open Data, Open Government 

Knowledge is power, and governments and organizations are often 
protective of their data (see, e.g.. Van de Walle & Comes (2014)). 


^http: //www.appsagainstebola.org/ 
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2.2. Diagnosis 


However, as the seale of the outbreak beeame elear, governments 
and organizations started to eooperate in exehanging data on the 
disease. Many organizations eventually joined open data initiatives 
that allowed seientists aeeess to their data, to be eombined with 
other open data sourees. 

In the midst of the fast-moving erisis, traditional methods for 
solving problems did not move fast enough. Volunteer efforts have 
sprung up in Afriea and around the world in a eombtnation of open 
data, analyties software, and erowdsoureing. IBM has set up an 
Afriean Open Data Initiative to help Afriean eountries tap open data 
as a means of addressing health, infrastrueture and eeonomie ehal- 
lenges. The World Health Organization provided data. In New York 
a grassroots Ebola Open Data Jam was organized]^ The UN Offiee 
for the Coordination of Humanitarian Affairs set up a Humanitarian 
Data ExehangeJ^ The government of Sierra Leone ereated its own 
Open Data initiative]^ The Ebola epidemie eaused the Liberia Gov¬ 
ernment to provide data on their government to the outside. In this 
way it faeiltated the step towards Open Government]^ 

These open data initiatives are of great value sinee they allow 
different seientists to work on the data, to eombine data sourees, 
and to improve their models. For example, one of the findings of the 
projeet for the Global Data on Events, Loeation and Tone (GDELT) 
is that a global monitoring of internet and media news ean provide 
a pieture of the unfolding of the outbreak that is as aeeurate as 
ground truth data]^ only mueh faster]^ 

The availability of different data sourees allows data to be tri¬ 
angulated, or eross-eheeked, whieh improves data quality. Models 
have been made to visualize the spread of the disease using heat 
maps that eorrelate loeations to publie sentiment, migration, in- 
feetions, and fatalities. Speeial tools, sueh as the Spatiotemporal 
Epidemiologieal Modeler tool, are designed to help seientists and 
publie health offieials ereate real time models of emerging infeetious 
diseases j£] 

^ http: / / eboladata. org 
^https; //data.hdx.rwlabs.org 
® http: / /www. ogl. gov. si 

®http: //www. opengovpartnershlp.org/countiy/llberia 

^https; //keystoneaecountabillty.wordpress.com/tag/nlek-van-praag/ 

® http: / / gdeltpr oj ect. org, 

®https; //www. eclipse, org/stem/ 
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2.3 Epidemiological Spread 


The second challenge is how to model reliably the spread of the 
epidemic. Epidemiologists traditionally have to rely on anecdotal 
information, on-the-ground surveys, and police and hospital re¬ 
ports. This type of data is often collected too slowly to curb the 
spread of the disease. Scientists have been working under time 
pressure to develop novel methods to map the spread more quickly 
and more precisely. Below 1 discuss two methods, viz. analyzing 


mobile phone data (subsection 2.3.1 and improving contact tracing 
(subsection |2.3.2) . 


2.3.1 Mobile Phone Data 

The first method is to analyze mobile phone data. Mobile phones 
are nowadays widely owned in even the poorest countries in Africa. 
They are a rich source of data in a region where only a few other 
reliable sources are available. Orange Telecom in Senegal handed 
over anonymized voice and text data from 150,000 mobile phones 
to a Swedish non-profit organization, whose data analysts drew up 
detailed maps of typical population movements in the region. Au¬ 
thorities and aid workers could then see where the best places were 
to set up treatment centers. Authorities also used this information 
to find the most effective ways to restrict travel in an attempt to 
contain the disease. 

A second way in which phone data is used, is by tracking the 
number of calls to helplines. A sharp increase from one particular 
area could suggest an outbreak and alert authorities to direct more 
resources to that area. Software companies are helping to visualize 
this data and overlay other existing sources of data from ground 
truth data to build up a richer picture. 

Mobile phone data can be used to improve the accuracy of epi¬ 
demiological models. Epidemiology uses advanced statistics to model 
the spread of a disease, often based on historical data, the level of 
contagiousness of the disease, and on behavioral factors. Dynamic 
models combine historical data with current field measurements. 
The dynamic models can be more precise in their prediction of the 
spread of a disease than static historical models. The difference in 
accuracy can be large, with serious consequences for policy makers. 

As a case in point, we mention a report of September 2014 by the 
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Center of Disease Control. It analyzes the impaet of underreporting 
and suggests eorreetion of ease numbers by a faetor of up to 2.5. 
With this eorreetion faetor, approximately 21,000 total eases were 
estimated for the end of September 2014 in Liberia and Sierra Leone 
alone. The same report prediets that the number of eases eould 
reaeh 1.4 million in Liberia and Sierra Leone by the end of January. 
Two months later, at a eongressional hearing, the direetor of the 
CDC said that the number of Ebola eases was no longer expeeted 
to exeeed 1 million, moving away from the worst-ease seenario that 
had been previously predieted. 

New data allow new mathematieal models to be validated. One 
model that has attraeted attention is the IDEA model, a straightfor¬ 
ward two parameter mathematieal model that appears to model the 
spread of the disease well (ef., Fisman et al. |2014 ). 

Aeeess to real time data, sueh as the measurement of migra¬ 
tion patterns through mobile phone traeking, is of great value to 
improve epidemiologieal models. Ineorporating data from different 
sourees Into simulation models allows data triangulation to prediet 
the spread of the disease better. Improving the aeeuraey of sta- 
tistieal models is important not only for better targeting of relief 
work, but also for improving the reputation of aid organizations as 
providers of trustworthy Information. 


2.3.2 Contact Tracing 

The seeond method for better mapping the spread of the disease is 
to improve eontaet traeing. Contaet traeing is an important method 
(1) for understanding the spread of Ebola and (2) for aequiring eor- 
reet numbers on the size of the epidemie. Contaet traeing requires 
effeetive eommunity surveillanee so that a possible ease of Ebola 
ean be registered and diagnosed. Subsequently everyone who has 
had elose eontaet with the patient must be found and traeked for 
21 days. This requires eareful reeord keeping and many properly 
trained and equipped staff. There is a substantial effort to train 
volunteers and health workers, sponsored by USAID. Aeeording to 
WHO reports, 25,926 eontaets from Guinea, 35,183 from Liberia 
and 104,454 from Sierra Leone were listed and being traeed at the 
end of 2014. 

Contaet traeing is labor intensive. Patients are interviewed and 
their relatives over the past period are eontaeted to establish how 
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they were likely infeeted, and whom they eould have likely infeeted. 
Contaets are watehed for 21 days, to see whether they develop 
symptoms of the illness. Thus, a soeial graph of the patient is built. 
By eombining soeial graphs of people in an area an overall view of 
the network of the disease in a eertain area and time period ean be 
ereatedF^ 

Estimating the spread of the disease is diffieult. A study pub¬ 
lished in Deeember 2014 by |Searpino et alT (2014) found that trans¬ 
mission of the Ebola virus oeeurs prineipally within families, in hos¬ 
pitals and at funerals. The data, gathered during three weeks of 
eontaet traeing showed that the third person in any transmission 
ehain often knew both the first and seeond person. The authors 
estimated that between 17 pereent and 70 pereent of the eases in 
West Afriea are unreported. Prior projeetions had estimated a mueh 
higher figure. The study eoneludes that the epidemie is not as dif¬ 
fieult to eontrol as feared, if rapid, vigorous eontaet traeing and 
quarantines are employed. 

Traditional eontaet traeing methods involve traveling to patients 
and interviewing them. Online social networks and contact lists of 
patients provide quick information about the kind of network and 
travel patterns of patients. Patients with many contacts and an ac¬ 
tive travel pattern can be quickly identified, allowing more efficient 
use of scarce tracing personnel. Current manuals do not prescribe 
taking online information into account. Apps are being developed to 
ease the process of contact tracingj^ We conclude that innovative 
contact tracing methods such as analyzing online social networks, 
mobile phone data and apps can speedup the process of contact 
tracing, to better map the epidemiological spread. 


2.4 Treatment & Drug Discovery 

The third challenge that I will discuss is related to the prevention of 
Ebola. It concerns treatment and drug discovery. Pharmacologists 
have developed a range of high performance drug discovery tech¬ 
niques over the past years. They are used intensively to find a cure 
for Ebola. 

^°CDC Methods for Implementing and Managing Contact Tracing for Ebola Virus 
Disease in Less-Affected Countries, Centers for Control & Prevention, 2014 
^^http://www. appsagalnstebola.org 
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2.4.1 Treatment 

At the time these words were written there is no approved vaeeine 
for Ebola, despite a large effort by the pharmaeeutieal industry. In 
addition, there is no eure or speeifie treatment that is eurrently ap¬ 
proved. Treatment is primarily supportive in nature, as survival 
ehanees are improved by early eare with rehydration and symp- 
tomatie treatment. A number of experimental treatments are being 
eonsidered for use in the eontext of this outbreak, and are eurrently 
in elinieal trials. Patient data is reeorded to understand the most 
effeetive eombination of therapies. In other diseases a well balaneed 
eombination of symptomatie treatment has been shown to inerease 
both life expeetaney and the quality of life of patients. Transpar¬ 
ent aeeess to reliable patient reeords for doetors and seientists is 
neeessary for effeetive treatment development. 

2.4.2 Drug Discovery 

Finding a preventive vaeeine for Ebola is of prime importanee. Ac¬ 
cording to a reeent study by the US National Institute of Allergy 
and Infeetious Diseases the seientifie eommunity is still in the early 
stages of understanding how infeetion with the Ebola virus ean be 
treated and prevented. Many Ebola vaeeine eandidates have been 
developed in the deeade prior to 2014, but none has yet been ap¬ 
proved for elinieal use in humans. Several promising vaeeine eandi¬ 
dates proteet nonhuman primates (usually maeaques) against lethal 
infeetion, and some are now going through the elinieal trial proeess. 

The proeess of drug diseovery has advaneed to a state where 
many steps have been automated. High throughput sereening is 
a method for seientifie experimentation in drug diseovery. It uses 
roboties, data proeessing and eontrol software, and ineludes sen¬ 
sitive deteetors and deviees for handling liquids. High throughput 
sereening allows a researeher to eonduet quiekly millions of ehem- 
ieal, genetie, or pharmaeologieal tests. Researehers have developed 
eomputational methods to analyze these test results. 

Results from high throughput sereening are used to refine sim¬ 
ulation models of the virus, in order to design a vaeeine. Simulation 
data ean then be eheeked with in-vitro observations. 

Pharmaeology and moleeular biology are aetive fields of researeh, 
where many results on gene-disease findings and related drugs are 
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published. In addition to analyzing databases of molecules and pro- 
teines the publications themselves allow a drug discovery method 
based on text mining and statistics. In this method textual corre¬ 
lations in scientific papers are analyzed. A high textual correlation 
indicates an increased possibility of a relation between molecules 
and diseases, warranting further research. The advantage of such 
in-silico drug discovery are (1) the low cost and (2) the systematic 
nature of the search, allowing a much wider investigation of accept¬ 
able relations than is possible with traditional methods. 


2.5 Ebola: Three Challenges for Data Science 

At this point, we have discussed three challenges where data sci¬ 
ence is helping to resolve the Ebola tragedy. The outbreak occurred 
in countries with a poor health infrastructure, and a lack of reliable 
data. Governments and organizations learned the importance of 
opening up their data. Data scientists could then work on (1) better 
methods for diagnosis, (2) new online epidemiological models, and 
(3) developing vaccines and treatment methods. 

Ad (1) Open data initiatives improve the quality of data about 
the outbreak. Novel methods such as smartphone self assessment 
apps have been developed, and the movement of people is analyzed 
based on data from mobile phones. 

Ad (2) New online epidemiological models are developed that help 
simulate the spread of the disease based on data that is continu¬ 
ously being updated. A relatively new area is the analysis of online 
social networks and call information for contact tracing, to improve 
the accuracy and efficiency of manual methods. 

Ad (3) Pharmacologists are working hard to develop vaccines and 
treatment drugs for Ebola, making use of high throughput drug dis¬ 
covery methods and data analysis in trials. 

In conclusion, data science has permeated the methods of doctors, 
aid workers, epidemiologists, and pharmacologists, helping them to 
fight the disease. 

Let us now look into more detail at the technologies that data 
science is using. It allows us to understand future developments 
for Ebola, and for other domains. 



Chapter 3 

Data Science Technologies 


In the past, data collection and processing techniques were limited 
in their power and versatility. In the last decade techniques have 
progressed considerably. For Ebola a wide range of data sources are 
used, such as mobile phone data, diagnostic app data, social net¬ 
work data, and advanced mathematical models. Combining these 
kinds of data requires new data processing technologies. 

We will now describe three techniques in more detail. (1) For di¬ 
agnosis we will look at data quality and representation techniques, 
(2) for epidemiological spread we will look at analysis techniques for 
diverse and large data sets, and (3) for treatment we will look at high 
performance data analysis techniques. 


3.1 Data Quality & Representation 

I will start with techniques for data quality and representation that 
are used in the diagnosis part of the Ebola outbreak. 


3.1.1 Quality of Ebola Data 

An important aspect of data science is data quality. In many projects 
the most time consuming task is ensuring the quality of the data: 
cleaning the data, and checking for missing and inconsistent val¬ 
ues (see, e.g., Rahm & Do (2000) ). For Ebola, gathering high quality 
data is a difficult challenge, and alternative sources were sought, 
such as mobile phone data and internet news postings. These ad¬ 
ditional sources allow data to be triangulated so that the quality 
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3.2. Analysis Techniques for Diverse & Large Data Sets 


increases. Also, data ean be eolleeted more quiekly and is broader 
in seope. 


3.1.2 Knowledge Representation Techniques for Ebola 


In diagnosing Ebola data from different sourees is eolleeted in differ¬ 
ent data sets. It is in eombining data from different areas where the 
real power of data seienee lies: triangulating data to improve data 
quality, and also, finding unexpeeted patterns. To be able to eom- 
pare items from different data sets, the data must be represented 
tn an organized and eomparable manner. The field of knowledge 
representation studies this aspeet. It uses teehniques sueh as se- 
mantie networks and automated infereneing to organize knowledge 
in taxonomies and ontologies. Semantie web teehniques for linked 
open data allow automatie tnferenee of diverse kinds of data, sueh 
as soeial network data (ef., [Groth et al.| (2012) ) . Soeial and semantie 
network teehniques are areas of aetive researeh. Their use in help¬ 
ing to diagnose Ebola eases illustrates how fundamental researeh 
and real world ehallenges ean go together. 


3.2 Analysis Techniques for Diverse & Large Data 
Sets 

I will now diseuss two teehniques for analysis of diverse and large 
data sets that are used in modeling the epidemiologieal spread of 
the Ebola outbreak. 


3.2.1 Diverse Data Sets 


Epidemiology makes good use of statisties and data analysis teeh¬ 
niques. Developers of statistieal methods have a history of stan¬ 
dardizing their best algorithms Into libraries and software paekages. 
Well known paekages that are used in epidemiology are SPSS (Meul 


man & Heiserj |2001t |Field[ |2009) , Weka (Witten et al.j |2011) , and 


R (R Core Team et al . \ 2012) . These paekages have paved the way 
for the use of data analysis teehniques in epidemiology and in other 
seienees. 

The data sourees that are used for traeking the spread of the 
2014 Ebola outbreak are diverse and go beyond traditional tables of 
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3.3. High Performance Drug Discovery Techniques 


numerical data. Data can be text documents, sound, pictures, even 
video, and data from motion sensors. Data can be dynamic, for 
example an incoming stream of messages or video. Conventional, 
linear, statistical methods are not suited to analyze the data from 
the Ebola outbreak. Efforts to analyze this kind of high dimensional 
data have yielded new statistical and machine learning techniques 
(see, e.g., Hastie et al. j2009); | Johnstone & Titterington| (2009) ; 


Takes (2014) ; Schraagen| (2014) ). As the Ebola case shows, still 
more techniques are needed, and the advanced techniques must be 
packaged in a way that is accessible for epidemiologists and other 
scientists. 


3.2.2 Large Data Sets 

Current data sets are larger than before, have a more diverse struc¬ 
ture than before, and change more frequently than before. Finding 
answers in such large, unstructured, data sets requires intelligent 
search algorithms that adapt to the search space at hand. Many 
years ago, in Rotterdam and Edmonton, I started to work tn this 
held, as part of my PhD research. 

For analyzing large data sets a variety of adaptive search tech- 


niques exists, ranging from stochastic methods (see, e.g., Hoos & 

Stiitzle (2004); Ruijl et a] 

. (2014a||b)), multiple-objective optimiza- 

tion (see, e.g., Koch et al. 

201m), evolutionary algorithms (see, e.g.. 

Back et al. J2013); Back 

(2014)), to new versions of neural net- 


shown remarkable success, although many challenges remain, 
subsection 15.2.^ I will describe ideas for future research. 


In 


3.3 High Performance Drug Discovery Techniques 

In searching for vaccines, high performance data analysis tech¬ 
niques are heavily used. I will briefly discuss techniques from high 
performance computing, a field in which I worked as a postdoc, first 
tn Cambridge at MIT, and later in Amsterdam at the VU. 

Quickly analyzing large data sets requires fast algorithms and 
fast computers. Initially supercomputers were used for numerical 
modeling, for applications such as computational fluid dynamics. 
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3.4. Data Science: Three Techniques for Ebola 


for weather predietion, and for simulations ranging from nuelear to 
galaetie proeesses. 

In eontrast, many of the drug diseovery teehniques for Ebola in¬ 
volve elassiiieation and diserete ehoiee (both for epidemiology and 
for vaeeine diseovery). These problems require the applieation of 
eombinatorial methods, as used, for example, in route planning 
problems, seheduling (see, e.g., Plaat (1996) ; Hoos & Stutzle (2004) ), 
or for searehing for relations between genes and diseases in large 
databases. Together, methods from numerieal and eombinatorial 
analysis eomprise data seienee. There have been great advanees 
in high performanee eomputing, eombinatorial optimization, and 


databases (see, e.g., Plaat et al. 

(2001); jBonez et al. 

(2008); 

Dean 

& Ghemawat 

(2008); Setnstra et al. 

(2011); 

Engle et a 


(2012); 

Mir- 

soleimani et al. (2014)). These have enabled the applieation of su- 


pereomputing to fields as diverse as the life seienees, the soeial sei- 
enees, and the humanities. 

Due to the tnereased need for data analysis the worldwide de¬ 
mand for eompute power is inereastng sharply. In this respeet it 
is remarkable to see that the Netherlands investment in seientifie 
eompute power is not keeping paee. Our plaee in the worldwide 
list of supereomputers, the TOP 500, is embarrassingly low, and 
eertainly not eommensurate with that of a data eeonomyj^ 


3.4 Data Science: Three Techniques for Ebola 

We have diseussed three teehniques that are used to resolve the 
Ebola tragedy. These are teehniques for (1) eolleeting high quality 
data and organizing the data so that eombinations between data 
sets of a diverse strueture ean be made, (2) for the analysis of large 
and diverse data sets, using adaptive teehniques for high dimen¬ 
sional data sets, and (3) high performanee drug diseovery teeh¬ 
niques. High performanee teehniques are neeessary sinee the size of 
the data, espeeially when eombinations are made, quiekly beeomes 
too large for ordinary eomputers. 


^ http ://top500.org 








































Chapter 4 

Multidisciplinary 

Cooperation 


We have now surveyed data seienee teehniques that are used for 
Ebola and that have ehanged the way in whieh the disease is han¬ 
dled. For a moment we will digress and look at other applieations, 
outside the life seienees, in whieh data seienee is eausing a similar 
ehange. We will start with astronomy. 


4.1 Astronomy 


In astronomy, the Low Frequeney Array (LOFAR) radio teleseope 
eonsists of 25,000 small antennas that are spread out over a larger 
area to effeetively form one large virtual antenna jRottgering et alT| 

2013). LOEAR’S antennas together generate 


2006 Haarlem et al. 


so mueh raw data that it has to be redueed before it ean be stored 
for further proeessing and analysis (ef., De Vos et al. (2009 ). A ded- 
ieated supereomputer, BlueGene/L, has been built to do the signal 
proeessing of LOFAR (Romein et al.[ [2006 201 1) . LOFAR’s design 
has not only been made possible by advaneed sensor teehnology, 
but also by fast signal proeessing algorithms and large eompute 
power. 
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4.2. Physics 


4.2 Physics 


In physics, particle experiments generate large amounts of data. 
On the 4th of July of 2012 one of the most important seientifie dis- 
eoveries in physies was announeed: two independent experiments 
reported results that were eonsistent with the deteetion of the Higgs 
boson (Aad et ^ |2012[ |Chatrehyan et aLj |2012) , the last elusive 
partiele from the standard model. A year later Peter Higgs was 
awarded the Nobel prize, together with Franeois Englert0 Caleula- 
tions from almost 50 years ago, predieting the partiele’s existenee, 
had been proven eorreet. 

It has been reported that around 100 Petabyte of data has been 
generated in the Large Hadron Collider at CERN in these exper¬ 
iments. To put that amount in perspeetive, my somewhat older 
laptop has a storage eapaeity of 128 Gigabyte. The amount of data 
stored at CERN would require 800,000 of those laptops to store it. 

In addition to experimentalists, theorists too work with large 
amounts of data. Ever stnee the 1960s theoretieal physieists have 
been using eomputers to manipulate large formulas to prediet ex¬ 
perimental results. Veltman and’t Hooft used a speeial eomputer 
algebra system for the ealeulation for whieh they reeeived their No¬ 
bel prize in 1999. Inspired by their system Jos Vermaseren devel¬ 
oped an i mproved system ealled FORM, to work with sueh large 
formulas (Vermaseren 2000[ Ueda & Vermaseren 2014) .^ 

For the next generation of partiele experiments even more eom- 
plex ealeulations are required. In the HEPGAME projeet, for whieh 
we gratefully aeknowledge EU funding through the ERC Advaneed 
seheme, Jos Vermaseren, Jaap van den Herik and I work, with 
our PhD students and postdoes, on advaneed eombinatories and 


maseren et al. 

2013 

Ruijl et al. 

2014b 

Mirsoleimani et al. 

2014 


^http://press.web.cern.ch/press-releases/2011/12/atlas-and-cms- 
experiments-present-hlggs-search-status 
^http://www.nlkhef.nl/'form 
^ http: / / hepgame. org 













































21 


4.3. Law Enforcement 


4.3 Law Enforcement 


One field with a natural interest in the behavior of their “eustomers” 
is law enforeement. Aetivities to reduee terrorism, erime, hooligan¬ 
ism, and jihadism are beeoming increasingly driven by information. 

Data-driven methods are credited with modern policing successes 
in Los Angeles, New York, and other cities]^ Our national police is 
also gathering data to increase effectiveness, by correlating crime 
figures with police actions. Intelligent blue, instead of more blue, 
is the new motto. Many police forces are experimenting with intel¬ 
ligence led policing (Meestei^ |2014) . Other data-driven methods 
have also shown success. Combining scenario methods with data 
analytics can be used to anticipate criminal behavior to some de¬ 
gree (De Kock 2014 . 


4.4 Commerce 

Data science is an important factor in the online and offline econ¬ 
omy. In bookselling (Amazon) and online video (YouTube, Netflix) 
the volume of buying decisions and views allows statistically sig¬ 
nificant personalized recommendations to be computed. These rec¬ 
ommendations drive much of sales. Data warehouses have become 
core systems, for example, for calculating online ticket prices in the 
hospitality and travel industry. 


4.5 Regulation 


Increasingly we live our lives online, where expectations about pri¬ 
vacy may not hold. Technology is proving a difficult topic for regu¬ 
lators that wish to protect our rights. As in all of our society, moral 
and ethical issues arise, and research into the philosophical and 
legal aspects of behavioral data collection is an important area (see, 
e.g., |Van den Berg & Leenes j2013 ; Van den Berg & van der Hof 
(2012) 7 Van Den Herik et al. (2014) ). Active legal research is needed, 
and legal scholars need to have an adequate technological under¬ 
standing (Prins 2014 Van den Berg & Keymolen) 2013) [Van der 


http://www.governing.com/toplcs/publlc-justlce-safety/Data-diiven- 


PoUclng.html 
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4.5. Regulation 


Zwaan et al. 


2014 










Chapter 5 

Outlook 


Having looked at how data science is used for Ebola and other appli¬ 
cations, it is now time to look into the future. First we will discuss 
how data science improves the chances of preventing future virus 
outbreaks. Next we will discuss plans for data science research. 
Finally, we will look at data science in Leiden. 


5.1 Future Ebola Outbreaks 


The medical history of conquering diseases is one of many suc¬ 
cesses, although important challenges remain. 

The loss of some 10,000 lives since the 2014 Ebola outbreak 
and the ensuing human and social disruption are deeply tragic. So 
tragic, and so large is the impact of the outbreak, that it caused 
scientists, volunteer efforts, and the international community (1) 
to work around the clock to improve diagnosis techniques for the 
disease, (2) to improve the tracking of the spread of the disease, 
(3) to gather information on patient treatment, and (4) to work on 
experimental vaccines. 

Due to the urgency of the situation governments and organiza¬ 
tions opened up their data, and researchers used the latest data 
science techniques in their approaches. Regardless of the reason 
novel medical and data science methods were created. The accuracy 
of epidemiological models improved greatly—causing predictions to 
be adjusted by more than an order of magnitude (see section 2.3. 1) . 
Pharmacologists have been working hard to create vaccines. At the 
time of this writing the first vaccines have been scheduled for phase 
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5.2. Future Developments in Data Seienee 


3 trials at the end of 2015. 

Unless vaeeines will beeome available at a low eost and at a wide 
seale, it is likely that Ebola ineidents will eontinue to oeeur. How¬ 
ever, with open governments and the eoneerted effort from the inter¬ 
national seientifie eommunity, new diagnostie tools, epidemiologieal 
models, and treatment and drug diseovery methods have been de¬ 
veloped. Together these will likely prevent outbreaks of the seale of 
2014. Many of the lessons, sueh as advanees in dynamie epidemi¬ 
ologieal modeling, are likely to help in eontaining other diseases as 
well. 

The availability of open data and data seienee teehnology are 
eausing fundamental ehanges in our seienee and in our soeiety, 
as the Ebola ease illustrates. Data seienee methods have beeome 
indispensable in eontaining sueh an outbreak. The simple faet 
that signifieantly more data ean be measured, analyzed, and un¬ 
derstood, has profound implieations in all of seienee and soeiety. 


5.2 Future Developments in Data Science 

The data seienee vision is to measure more, to analyze more, and 
to know more. It is used in poliey making: evidenee-based-poliey- 
making should lead to better deeisions. It is used in polieing: Intelligenee- 
lead polieing will reduee erime. Better data on eonsumer prefer- 
enees allows marketers to ereate better reeommendations and ad¬ 
vertisements. More data allows astronomers to uneover more about 
our universe. By using large seale text analysis historians ean bet¬ 
ter understand historieal developments. The Internet of Things will 
eause a revolution in predietive maintenanee. When the right in¬ 
formation is available humanitarian aid ean be more effeetive. New 
drugs ean be diseovered, outbreaks of diseases ean be stopped, and 
diseases may eventually be eradieated, when the right information 
ean be gleaned from the data. 

Realizing this vision is dependent on seienee and teehnology. We 
must be able: (1) to have high quality data that ean be represented 
and eombined in a meaningful manner, (2) to analyze diverse and 
large data sets, and (3) to do so quiekly, using high performanee 
eomputing methods. 

The goal of my ehair is to ereate new data seienee methods, for 
the large and diverse data sets that seientists are inereasingly using. 
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Let us now look in more detail at the improvements that we will be 
working on. We will start with data quality and representation. 


5.2.1 Data Quality & Representation 


Open seientifle data sourees enable new forms of knowledge dis- 
eovery. Publieation mining experiments ean yield results at signif- 
ieantly lower eost than traditional in-vitro experiments. The work 
in the field of knowledge representation, whieh studies taxonomies 
and information elassifleation, is of great importanee here. At our 
eenter we are working on extending this to other seienees, for ex¬ 
ample to text mining of historieal texts and to erowd soureing for 
museum eolleetions. Promising eooperations between biosemanties 
groups, database groups and high performanee eomputtng groups 
are happening. 

We are in the fortunate position to have as part of our eenter one 
of the leading high performanee database systems researehers, who 
works on MonetDB (Bonez et al.[ , 2008) . This is a great asset for all 
our data management researeh projeets. 


5.2.2 Analysis Techniques for Diverse & Large Data Sets 


Advanees in modeling drive statistieal and eomputational teehniques 
for Ebola, and for data seienee. An example is the need for valida¬ 
tion of simulation results of multi-seale models (see. 


e.g., 


Porte- 


gies Zwart et al.] (2013); |Merks| (2015)). The inereased eomplexlty 


of these models will demand better validation methods and lead to 
an inereased need for observation data for initial values. In our 
eenter we are working on maehine analysis of numerieal simula¬ 
tion data, and on predietive maintenanee. To eompare the quality 
of algorithms benehmarks are needed. In maehine learning we see 
benehmarklng initiatives for data sets and algorithms, sueh as UCQ 
and OpenME^We will use OpenML in our algorithm development 
work. 

In many researeh projeets a wide range of real world data is used 
from health (sueh as Ebola), to vibration data from bridges, to flnan- 
eial data from mortgages and eommeree. LUMC, IBL and LIACS are 


^ https; / / archive. Ics. ucl.edu / ml / datasets .html 
^http://www.openml.org 
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working on knowledge representation and eombinatorial optimiza¬ 
tion for metagenomies applieations. Combinatorial algorithms are 
fruitfully applied in logisties operations in humanitarian aid, and 
many other seientifie applieations that have planning and sehedul- 
ing ehallenges. In high performanee eombinatorial optimization a 
fundamental move to parallel algorithms has oeeurred, for example 
in high throughput drug diseovery. This ereates ehallenges for algo¬ 
rithm designers and a strong need for formal verifieation methods. 
I am looking forward to eooperation in this area. 

At a more fundamental level, there is exeiting work happen¬ 


ing in seareh spaee analysis and visualization (see, e.g., Verbeek 


et al. (2007) ; Oehoa et al. (2014)), a possible eombination with so 


lution trees is interesting (Plaat et al.[ |1994a[|b . We will look for 


relations between natural and heuristie optimization: finding eom 
mon elements in evolutionary approaehes (Baek 2014), deep neural 
nets (Krizhevsky et alTj |2012 Van den Berg 1996), pattern reeog 


2013 

Ruijl et al. 

2014a), and elassieal enumeration algorithms (see. 

e.g., 1 

Bussell & Norvig 

(2011); Ruijl et al. 

(2014ayb)). Reeent experi- 


enee with deep neural nets and stoehastie optimization suggest the 
feasibility of sueh relations (see, e.g., Maddison et al. (2014); Clark 


& Storkey (2014) ; Mnih et al.|(2015) ). Also, genetie and evolution¬ 


ary algorithms are often sueeessfully applied to elassie optimization 
problems (Izzo et elL| |2013) [David et al.[ 2014) . 

Sueh new methods will allow even more eomplex problems to be 
analyzed and understood. Many of the sueeesses are driven by real 
world data and real world problems, sueh as the Ebola outbreak. 
Promising applieation areas are the analysis of simulation output, 
predietive maintenanee, and drug diseovery. Further applieation 
fields are humanitarian aid, marketing, organizational behavior and 
management (see, e.g., Plaat|(^010) ), finaneial analysis, and sports 
and, of eourse, health. 


5.2.3 High Performance Computing 

High performanee eomputtng in Leiden has a rieh history in areas 
sueh as numerieal eomputtng, eompiler teehnology, embedded sys¬ 
tems, distributed eomputing, and sparse matrix eodes with strong 
groups throughout the Faeulty of Seienee. The work in the DAS 
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projects (Seinstra et al. 


2011 


and the Little Green Machine]^ is 
state of the art. Data science needs a strong high performance 
systems group and we will work to further strengthen this field. 
Among the research topics are questions in compiler technology, 
data sharing (see, e.g., Klelmann et al. (1999|; Plaat et al.|(2001 ), 
work and data scheduling (see, e.g., Romein et al. |200^ Kishimoto 


et al. 


et al. 


12013 ), accelerators such as GPGPU (see, e.g., Mirsoleimani 


[2014 ; |Karami et al.| (2014) ), and other topics. Many appli¬ 
cations, for example, imaging and astrophysics simulations, can 
benefit from these methods. 


5.3 The Leiden Centre of Data Science 

Scientists and society have found that high performance analysis 
methods can solve some of their problems and answer some of their 
questions. In fields ranging from the humanities, to astronomy, to 
containing outbreaks of contagious diseases, they learn more and 
create new insights. 

The purpose of the Leiden Centre of Data Science is to solve real 
world problems in science and society. In the process, these efforts 
drive the invention of new data science techniques. The interest 
in data science is high, inside our university, and outside. The 
Leiden Centre of Data Science is an ideal network organization for 
cooperation with other universities, academies, and data science 
centers. We cooperate with national and local governments, with 
commercial companies, and with social and cultural institutes. We 
organize data science summer schools, labs, and regular academic 
courses. 

The purpose of the Centre is also to facilitate cooperation be¬ 
tween different disciplines. As such our focus is on community 
building, on building a research infrastructure, and on initiating 
projects with researchers in academia and industry. Since the of¬ 
ficial opening of the Centre, less than a year ago, much has been 
achieved, and the interest in data science has only grown. 

^ http: / / WWW. littlegreenmachlne. org 
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5.4. Conclusion 


5.4 Conclusion 


Mathematics and computer science are impacting our modern lives 
in many ways. The growing availability of data and data process¬ 
ing technology is causing profound changes in science and society: 
from the way that the Ebola outbreak is approached, to predictive 
maintenance of bridges and infrastructure, to fine grained market¬ 
ing, to finding new drugs, to monitoring health behavior, to find¬ 
ing social networks in ancient Chinese texts, to analyzing computa¬ 
tional fluid dynamics simulations, and to large scale simulations of 
star systems. Patterns are everywhere, and by learning to find them 
in large and diverse data sets we gain insights and solve problems. 

From the perspective of data science the 2014 Ebola outbreak 
is of special significance. Governments and organizations provided 
open access to data, enabling the active creation of new data sci¬ 
ence tools. New diagnosis techniques and better epidemiological 
models have succeeded in reducing an aggressive epidemic faster 
than would otherwise have been possible. Pharmacologists are us¬ 
ing high throughput methods that are increasing the chances of 
finding a vaccine. As a result, there is legitimate hope that this epi¬ 
demic is soon under control. Data science is helping to save many 
lives—abstract notions from the worlds of statistics and algorithms 
are having an effect that is anything but abstract. The world is full 
of data, and data scientists are here to help. 

At the start of this lecture I asked the question why there is so 
much interest in data. Subsequently, we discussed some important 
practical examples. Kurt Lewln once remarked that there is nothing 
so practical as a good theory, and I agree. For data science, such a 
theory has been proposed six years ago. In 2009 data-intensive re¬ 
search methods were named the Fourth Paradigm |Hey et ah 2009) . 
This term implies that data science complements the methods of 
induction, deduction, and simulation, the three paradigms of the 
experimental cycle. The term suggests that data science is as fun¬ 
damental as these scientific paradigms. 

Data science gives scientists a new, disruptive, way to look for 
answers to their questions. For its consequences in theory and 
practice, we welcome data as a new paradigm to science. 
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